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Current state-of-the-art classification and detection algorithms rely on supervised training. In this 
work we study unsupervised feature learning in the context of temporally coherent video data. We 
focus on feature learning from unlabeled video data, using the assumption that adjacent video frames 
contain semantically similar information. This assumption is exploited to train a convolutional pool¬ 
ing auto-encoder regularized by slowness and sparsity. We establish a connection between slow 
feature learning to metric learning and show that the trained encoder can be used to define a more 
temporally and semantically coherent metric. 


Our main assumption is that data samples that are temporal neighbors are also likely to be neighbors 
in the latent space. For example, adjacent frames in a video sequence are more likely to be seman¬ 
tically similar than non-adjacent frames. This assumption naturally leads to the slowness prior on 
features which was introduced in SFA ( [Wiskott & Sejnowski| ( |2002| )). 

Temporal coherence can be exploited by assuming a prior on the features extracted from the temporal 
data sequence. One such prior is that the features should vary slowly with respect to time. In the 
discrete time setting this prior corresponds to minimizing an norm of the difference of feature 
vectors for temporally adjacent inputs. Consider a video sequence with T frames, if Zt represents the 
feature vector extracted from the frame at time t then the slowness prior corresponds to minimizing 
Ikt “ To avoid the degenerate solution Zt = zq for t = 1...T, a second term is 

introduced which encourages data samples that are not temporal neighbors to be separated by at 
least a distance of m-units in feature space, where m is known as the margin. In the temporal setting 
this corresponds to minimizing max(Q,m — \\zt — Zt'\\j y ), whe re \t — t'\ > 1. Together the two 
terms form the loss function introduced in |Hadsell et al.| ( |T006| ) as a dimension reduction and data 
visualization algorithm known as DrLIM. Assume that there is a differentiable mapping from input 
space to feature space which operates on individual temporal samples. Denote this mapping by G 
and assume it is parametrized by a set of trainable coefficients denoted by W. That is, Zt = Gw ) • 

The per-sample loss function can be written as: 


L{xt,Xt',W) 


\\Gw{xt) - Gw{xt')\\p, if \t - t'l = 1 

max(0,m - \\Gw{xt) — Gw{xt')\\p) if \t - > 1 


( 1 ) 


The second contrastive term in Equationonly acts to avoid the degenerate solution in which Gw is 
a constant mapping, it does not guarantee that the resulting feature space is informative with respect 
to the input. This discriminative criteria only depends on pairwise distances in the representation 
space which is a geometrically weak notion in high dimensions. We propose to replace this con¬ 
trastive term with a term that penalizes the reconstruction error of both data samples. Introducing a 
reconstruction terms not only prevents the constant solution but also acts to explicitly preserve infor¬ 
mation about the input. This is a useful property of features which are obtained using unsupervised 
learning; since the task to which these features will be applied is not known a priori, we would like 
to preserve as much information about the input as possible. 


What is the optimal architecture of Gw foi* extracting slow features? Slow features are invariant to 
temporal changes by definition. In natural video and on small spatial scales these changes mainly 
correspond to local translations and deformations. Invariances to such changes can be achieved us¬ 
ing appropriate pooling operators jBruna & Mallat| ( [2013| ); |LeCun et al.| ( |1998| ). Such operators are 
at the heart of deep convolut ional networks (ConvNe ts), currently the most successful supervised 
feature learning architectures|Krizhevsky et al.|p012|). Inspired by these observations, let be a 
two stage encoder comprised of a learned, generally over-complete, linear map (We) and rectifying 
nonlinearity /(•), followed by a local pooling. Let the N hidden activations, h = f{Wex), be sub- 
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Figure 1: Pooled decoder dictionaries learned without (a) and with (b) the Li penalty using 





Figure 2: Block diagram of the Siamese convolutional model trained on pairs of frames. 


divided into K potentially overlapping neighborhoods denoted by Pi. Note that biases are absorbed 

by expressing the input x in homogeneous coordinates. Feature Zi produced by the encoder for the 
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input at time t can be expressed as (t) = ^ • Training through a local 


pooling operator enforces a local topology on the hidden activations, inducing units that are pooled 
together to learn complimentary features. In the following experiments we will use p = 2. Although 
it has recently been shown that it is possible to recover the input when We is sufficientl y redundant, 
recons tructing from these coefficients corresponds to solving a phase recovery problem |Bruna et ah] 
( 2014| ) which is not possible with a simple inverse mapping, such as a linear map Wd- Instead of 
reconstructing from z we reconstruct from the hi dden representation h. Thi s is the same approach 
taken when training group-sparse auto-encoders |Kavukcuoglu et al] ( |2009| ). In order to promote 
sparse activations in the case of over-complete bases we additionally add a sparsifying Li penalty 
on the hidden activations. Including the rectifying nonlinearity becomes critical for learning spars e 
inference in a hugely redundant dictionary, e.g. convolutional dictionaries [Gregor & LeCun ( |2010| ). 
The complete loss functional is: 


K 

L{xt,xr,W)= Y. + a|/ir|) + ^2) 

i=l 


Figurej^shows a convolutional version of the proposed architecture and loss. By replacing all linear 
operators in our model with convolutional filter banks and including spatial pooling, translation 
invariance need not be leamed [LeCun et aL] ( |1998| ). In all other respects the convolutional model is 
conceptually identical to the fully connected model described in the previous section. 
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